Skip to content

Conversation

@oknet
Copy link
Member

@oknet oknet commented Oct 10, 2018

The vc->add_to_keep_alive_queue() is called by:

  • Http1ClientSession::release()
  • Http2ClientSession::cleanup_streams()
  • Http2ClientSession::release_stream()

And the NetHandler::remove_from_active_queue() is called by
vc->add_to_keep_alive_queue().

The NetHandler::keep_alive_queue and NetHandler::active_queue are
internal queue of NetHandler. We must have acquired the NetHandler's
lock before doing anything with them.

When reenable a HttpSM from plugin, the HttpSM would be run into ethread
that different from client_vc lives because TSHttpSMCallback is
scheduled by eventProcessor.schedule_imm().

@oknet oknet requested a review from SolidWallOfCode October 10, 2018 13:06
@oknet oknet self-assigned this Oct 10, 2018
@oknet oknet added this to the 9.0.0 milestone Oct 10, 2018
@oknet oknet requested a review from scw00 October 10, 2018 13:08
Copy link
Member

@mingzym mingzym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, hope this will help us finish the long standing mysterious issues in the plugin & transaction system

@shinrich
Copy link
Member

We were also seeing some crashes with the active_queue_link as well. Resulting in stack traces like.

#0  0x0000000000808663 in DLL<UnixNetVConnection, UnixNetVConnection::Link_active_queue_link>::remove (this=0x2b76793fadd8, e=0x2b78900dc700)
    at /var/builds/workspace/157188-v3-component/BUILD_CONTAINER/rhel7-gcc6_2/label/DOCKER-HIGH/app_root/_build/ats_build/../../trafficserver/lib/ts/List.h:285
#1  0x000000000080815a in Queue<UnixNetVConnection, UnixNetVConnection::Link_active_queue_link>::remove (this=0x2b76793fadd8, e=0x2b78900dc700)
    at /var/builds/workspace/157188-v3-component/BUILD_CONTAINER/rhel7-gcc6_2/label/DOCKER-HIGH/app_root/_build/ats_build/../../trafficserver/lib/ts/List.h:390
#2  0x000000000080734e in NetHandler::remove_from_active_queue (this=0x2b76793facf0, vc=0x2b78900dc700) at ../../../../trafficserver/iocore/net/UnixNet.cc:748
#3  0x0000000000806fd3 in NetHandler::add_to_keep_alive_queue (this=0x2b76793facf0, vc=0x2b78900dc700) at ../../../../trafficserver/iocore/net/UnixNet.cc:698
#4  0x0000000000812a6a in UnixNetVConnection::add_to_keep_alive_queue (this=0x2b78900dc700) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:1549
#5  0x00000000006ce7fc in Http2ConnectionState::release_stream (this=0x2b80659d6f50, stream=0x2b7d8a87ad30) at ../../../../trafficserver/proxy/http2/Http2ConnectionState.cc:1238
#6  0x00000000006bf399 in Http2Stream::destroy (this=0x2b7d8a87ad30) at ../../../../trafficserver/proxy/http2/Http2Stream.cc:716
#7  0x00000000006bd6fd in Http2Stream::terminate_if_possible (this=0x2b7d8a87ad30) at ../../../../trafficserver/proxy/http2/Http2Stream.cc:377
#8  0x00000000006bd63b in Http2Stream::transaction_done (this=0x2b7d8a87ad30) at ../../../../trafficserver/proxy/http2/Http2Stream.cc:366
#9  0x000000000066a16b in HttpSM::kill_this (this=0x2b8104610e00) at ../../../../trafficserver/proxy/http/HttpSM.cc:7078
#10 0x000000000065b2af in HttpSM::main_handler (this=0x2b8104610e00, event=2301, data=0x2b8104612180) at ../../../../trafficserver/proxy/http/HttpSM.cc:2826
#11 0x0000000000831494 in Continuation::dispatchEvent (this=0x2b8104610e00, event=2301, data=0x2b8104612180) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:46
#12 0x00000000006b5f32 in HttpTunnel::main_handler (this=0x2b8104612180, event=103, data=0x2b807267f330) at ../../../../trafficserver/proxy/http/HttpTunnel.cc:1650
#13 0x00000000008313ad in Continuation::handleEvent (this=0x2b8104612180, event=103, data=0x2b807267f330) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#14 0x00000000007c13c3 in CacheVC::calluser (this=0x2b807267f0e0, event=103) at ../../../../trafficserver/iocore/cache/P_CacheInternal.h:639
#15 0x00000000007ca0fd in CacheVC::openWriteMain (this=0x2b807267f0e0) at ../../../../trafficserver/iocore/cache/CacheWrite.cc:1353
#16 0x00000000008313ad in Continuation::handleEvent (this=0x2b807267f0e0, event=1, data=0x2b7feafb7060) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#17 0x00000000008334e3 in EThread::process_event (this=0x2b7678a82010, e=0x2b7feafb7060, calling_code=1) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:132
#18 0x00000000008336ba in EThread::process_queue (this=0x2b7678a82010, NegativeQueue=0x2b767b964e10, ev_count=0x2b767b964e08, nq_count=0x2b767b964e0c) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:171
#19 0x0000000000833982 in EThread::execute_regular (this=0x2b7678a82010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:231
#20 0x0000000000833d01 in EThread::execute (this=0x2b7678a82010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:326
#21 0x0000000000832936 in spawn_thread_internal (a=0x25b3ee0) at ../../../../trafficserver/iocore/eventsystem/Thread.cc:85
#22 0x00002b7671183dc5 in start_thread () from /lib64/libpthread.so.0
#23 0x00002b7671eb276d in clone () from /lib64/libc.so.6

And digging in, we saw that the nh from another thread was showing up on the stack (in frame 3) in the stack above.

My current theory is that the problem was occurring from reusing a stale client_vc. And I thought we fixed it by removing release_netvc ("Fix crash in Http2ClientSession::release_netvc" on master). As we have deployed this fix, we are no longer seeing the active_queue crash.

The TSHttpSMCallback is schedule_imm from TSHttpTxnReenable which should schedule the TSHttpSMCallback on the same thread that called the Reenable. I guess for many plugins this would be a Task thread. Is that your scenarios? If so, your fix seems reasonable, but I'm a bit leery of the blocking locks.

@duke8253 is working on a thread affinity. I think we already have sufficient information in this case that we can just schedule on the thread associated with the HttpSM. Given a HttpSM sm, we should be able to call sm->ua_txn->adjust_thread(cont, event, data)

@oknet
Copy link
Member Author

oknet commented Oct 11, 2018

@shinrich Yes, I found the internal queue of NetHandler is accessed from another thread without acquire the NetHandler's lock.

If you have a bit leery of the blocking locks, I can just leave ink_release_assert check.

@oknet oknet changed the title Make operations on keep_alive_queue and active_queue are thread-safe Assert if operations on keep_alive_queue and active_queue are not thread-safe Oct 11, 2018
@oknet
Copy link
Member Author

oknet commented Oct 11, 2018

[approve ci autest]

@scw00
Copy link
Member

scw00 commented Oct 11, 2018

@oknet indicated that we can not take nh lock directly since it might cause dead lock.(nh first take their own lock and then acquired the vc lock, now vc is possible to access nh lock under vc's lock).

@oknet
Copy link
Member Author

oknet commented Oct 11, 2018

The autest always failed on H2 test after I just leave ink_release_assert here.

I think there are some bugs related with H2 codes. @masaori335

ccid_ctid, double_h2 and http2_priority failed:

The mutex->thread_holding is nullptr.

Fatal: UnixNet.cc:676: failed assertion `mutex->thread_holding == this_ethread()`
[Oct 11 07:22:15.717] {0x2b2da241ffc0} NOTE: crashlog started, target=13396, debug=false syslog=true, uid=989 euid=989
[Oct 11 07:22:15.718] {0x2b2da241ffc0} WARNING: failed to intialize management API: [5] Error establishing socket connection.
[Oct 11 07:22:15.718] {0x2b2da241ffc0} NOTE: logging to 0x8b9260
[Oct 11 07:22:15.719] {0x2b2da241ffc0} ERROR: wrote crash log to /var/tmp/ausb-4379.7080/ccid_ctid/ts/log/crash-2018-10-11-072215.log
traffic_server: received signal 6 (Aborted)
traffic_server - STACK TRACE: 
traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xc3)[0x58e9a0]
/lib64/libpthread.so.0(+0x11fb0)[0x7f4bdda8cfb0]
/lib64/libc.so.6(gsignal+0x10b)[0x7f4bdcbb0eab]
/lib64/libc.so.6(abort+0x123)[0x7f4bdcb9b5b9]
/var/jenkins/workspace/autest-github/7080/install/lib/libtscore.so.9(_Z11ink_warningPKcz+0x0)[0x7f4bdf7c6ad8]
/var/jenkins/workspace/autest-github/7080/install/lib/libtscore.so.9(_Z17ats_base64_encodePKhmPcmPm+0x0)[0x7f4bdf7c3ed0]
traffic_server(_ZN10NetHandler23add_to_keep_alive_queueEP18UnixNetVConnection+0xba)[0x7f007c]
traffic_server(_ZN18UnixNetVConnection23add_to_keep_alive_queueEv+0x26)[0x7fca0e]
traffic_server(_ZN20Http2ConnectionState14release_streamEP11Http2Stream+0x121)[0x6a9e43]
traffic_server(_ZN11Http2Stream7destroyEv+0x29b)[0x696b1d]
traffic_server(_ZN11Http2Stream21terminate_if_possibleEv+0xbc)[0x694aec]
traffic_server(_ZN11Http2Stream18main_event_handlerEiPv+0x9b4)[0x693e5a]
traffic_server(_ZN12Continuation11handleEventEiPv+0xc9)[0x59289d]
traffic_server(_ZN7EThread13process_eventEP5Eventi+0x12f)[0x823da3]
traffic_server(_ZN7EThread13process_queueEP5QueueI5EventNS1_9Link_linkEEPiS5_+0xd3)[0x823ff5]
traffic_server(_ZN7EThread15execute_regularEv+0x1ff)[0x8242b9]
traffic_server(_ZN7EThread7executeEv+0xf5)[0x8246b9]
traffic_server[0x82317d]
/lib64/libpthread.so.0(+0x7594)[0x7f4bdda82594]
/lib64/libc.so.6(clone+0x3f)[0x7f4bdcc73e6f]

@oknet
Copy link
Member Author

oknet commented Oct 11, 2018

openclose failed:

Fatal: UnixNet.cc:710: failed assertion `mutex->thread_holding == this_ethread()`
[Oct 11 07:25:32.082] {0x2b0ca3ee7fc0} NOTE: crashlog started, target=15194, debug=false syslog=true, uid=989 euid=989
[Oct 11 07:25:32.082] {0x2b0ca3ee7fc0} WARNING: failed to intialize management API: [5] Error establishing socket connection.
[Oct 11 07:25:32.082] {0x2b0ca3ee7fc0} NOTE: logging to 0xc9c260
[Oct 11 07:25:32.083] {0x2b0ca3ee7fc0} ERROR: wrote crash log to /var/tmp/ausb-4379.7080/openclose/ts/log/crash-2018-10-11-072532.log
traffic_server: received signal 6 (Aborted)
traffic_server - STACK TRACE: 
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xc3)[0x58e9a0]
/lib64/libpthread.so.0(+0x11fb0)[0x2b552e989fb0]
/lib64/libc.so.6(gsignal+0x10b)[0x2b552f70feab]
/lib64/libc.so.6(abort+0x123)[0x2b552f6fa5b9]
/var/jenkins/workspace/autest-github/7080/install/lib/libtscore.so.9(_Z11ink_warningPKcz+0x0)[0x2b552ccaead8]
/var/jenkins/workspace/autest-github/7080/install/lib/libtscore.so.9(_Z17ats_base64_encodePKhmPcmPm+0x0)[0x2b552ccabed0]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN10NetHandler19add_to_active_queueEP18UnixNetVConnection+0x148)[0x7f0368]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN18UnixNetVConnection19add_to_active_queueEv+0x26)[0x7fca62]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN18Http1ClientSession15new_transactionEv+0xae)[0x5e80ce]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN18Http1ClientSession7releaseEP22ProxyClientTransaction+0x19b)[0x5e7e25]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN18Http1ClientSession5startEv+0x33)[0x5e8d63]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN18ProxyClientSession17handle_api_returnEi+0xbe)[0x7bd326]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN18ProxyClientSession17state_api_calloutEiPv+0x30c)[0x7bd128]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN12Continuation11handleEventEiPv+0xc9)[0x59289d]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(TSHttpSsnReenable+0x13f)[0x5a089a]
/var/tmp/ausb-4379.7080/openclose/ts/plugin/ssntxnorder_verify.so(+0xb4c8)[0x2b557000b4c8]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN15INKContInternal12handle_eventEiPv+0x10e)[0x59746c]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN12Continuation13dispatchEventEiPv+0x107)[0x821cc9]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN7APIHook6invokeEiPv+0x76)[0x597d82]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN18ProxyClientSession17state_api_calloutEiPv+0x2cb)[0x7bd0e7]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN12Continuation11handleEventEiPv+0xc9)[0x59289d]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x12f)[0x823da3]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN7EThread15execute_regularEv+0x2e8)[0x8243a2]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server(_ZN7EThread7executeEv+0xf5)[0x8246b9]
/var/tmp/ausb-4379.7080/openclose/ts/bin/traffic_server[0x82317d]

I will create another PR to fix it.

…ead-safe

The `vc->add_to_keep_alive_queue()` is called by:
  - Http1ClientSession::release()
  - Http2ClientSession::cleanup_streams()
  - Http2ClientSession::release_stream()

And the `NetHandler::remove_from_active_queue()` is called by
`vc->add_to_keep_alive_queue()`.

The NetHandler::keep_alive_queue and NetHandler::active_queue are
internal queue of NetHandler. We must have acquired the NetHandler's
lock before doing anything with them.

When reenable a HttpSM from plugin, the HttpSM would be run into ethread
that different from client_vc lives because TSHttpSMCallback is
scheduled by `eventProcessor.schedule_imm()`.
@SolidWallOfCode
Copy link
Member

My view would be to fix it in TSHttpTxnReenable, where if the call is not on the same thread as the HttpSM and its NetHandler, it gets rescheduled on the correct thread.

@zwoop
Copy link
Contributor

zwoop commented Oct 11, 2018

Running this on docs.trafficserver.

@shinrich
Copy link
Member

For the asserts, it would be interesting to see if the lock wasn't being held. Or if it is being held by another thread. For the autest stack traces I wouldn't be surprised if the nh mutex just wasn't held.

@oknet
Copy link
Member Author

oknet commented Oct 12, 2018

@SolidWallOfCode @shinrich @zwoop @scw00

We should choose a rule to operating the keep_alive_queue and active_queue:

  1. The queues only be accessed from current ethread.
  • ink_assert(vc->thread == nh->thread);
  1. Must have acquired the NetHandler's lock before doing anything with the queues.
  • ink_assert(nh->mutex->thread_holding == this_ethread());

@shinrich
Copy link
Member

I think we should fix TSHttpTxnReenable to schedule to the correct ET_NET thread (satisfying option 1) rather than some thread in the ET_NET pool. It should still hold the lock.

@zwoop zwoop added the OnDocs This is for PR currently running, or will run, on the Docs ATS server label Oct 19, 2018
@oknet
Copy link
Member Author

oknet commented Dec 12, 2018

@zwoop any assert reported on docs?

@zwoop
Copy link
Contributor

zwoop commented Jan 8, 2019

No asserts on Docs :/ Do we want this PR into 7? If so, can you please make a "git cherry-pick -x" backport into the 7.1.x branch?

Update: of course, don't make a PR against 7.1.x until this has landed on master!

Copy link
Member

@scw00 scw00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to approve this pr event if it may cause crash.

@zwoop
Copy link
Contributor

zwoop commented Jan 11, 2019

This is cherry-picked into 7.1.6 via #4787

@zwoop zwoop removed OnDocs This is for PR currently running, or will run, on the Docs ATS server labels Feb 13, 2019
zwoop added a commit to zwoop/trafficserver that referenced this pull request Mar 14, 2019
… not thread-safe"

This reverts commit 3599e5e.

Seems this causes new, unpredcitable crashes. For now, lets revert this, and we can
keep working on this for 8.x / 9.x. Reference apache#4379 and apache#4921 and apache#5120.
zwoop added a commit that referenced this pull request Mar 14, 2019
… not thread-safe"

This reverts commit 3599e5e.

Seems this causes new, unpredcitable crashes. For now, lets revert this, and we can
keep working on this for 8.x / 9.x. Reference #4379 and #4921 and #5120.
@bryancall bryancall modified the milestones: 9.0.0, 8.1.0 Mar 27, 2019
@bryancall
Copy link
Contributor

Cherry picked to 8.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants